[V1][CUDA Graph] Fix attention metadata tensor sizes for padded batches#24002
[V1][CUDA Graph] Fix attention metadata tensor sizes for padded batches#24002ayushsatyam146 wants to merge 35 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a critical crash issue related to CUDA graph padding by ensuring attention metadata tensor sizes match the execution environment. The changes are well-targeted and logical. A new num_input_tokens parameter is introduced to _prepare_inputs to carry the padded token count. The call to _prepare_inputs is correctly moved to after padding is calculated in execute_model. The slot_mapping and num_actual_tokens are now correctly sized using this padded count. The implementation appears robust and correctly handles the backward compatibility case where no padding is applied. Overall, this is a solid fix.
c8c4642 to
6e87b60
Compare
LucasWilkinson
left a comment
There was a problem hiding this comment.
Thanks for taking this on! I think this might actually be a bit trickier then I initially thought; couple of comments
- if we are dispatching to full cudagraphs; then we should pad
num_reqsto match what the cudagraph was captured for since this information is used by the attention backends too; this may require changes to theCudagraphDispatcher - if we are dispatching to a piecewise CUDA-graph we don't need to pad
num_reqsand its probably better not to (to avoid wasted work in attention)
|
This pull request has merge conflicts that must be resolved before it can be |
|
Thanks for the review @LucasWilkinson. I will work on the suggested changes. Do you have any idea on why am I getting CI failures on this consistently? |
The CI can be flaky; I would just keep merging/rebasing-off main at a pretty regularly; tests are usually fixed on main in pretty short order |
6e87b60 to
b083990
Compare
098e689 to
6691ed4
Compare
|
@LucasWilkinson I have accomodated your suggestions in the new changes. Can you please take a look, Thanks! |
8241342 to
6d8a083
Compare
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ata' into cudagraph-fix
- Add missing Optional import to forward_context.py - Add missing NamedTuple import to cudagraph_dispatcher.py - Add missing Optional import to gpu_model_runner.py - Fix indentation in gpu_model_runner.py (try-except block)
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
Documentation preview: https://vllm--24002.org.readthedocs.build/en/24002/ |
|
This pull request has merge conflicts that must be resolved before it can be |
Fixes #23789. Move CUDA graph padding before attention metadata construction to ensure tensor sizes match execution environment. Fixes crashes in FlashMLA and other attention backends caused by metadata/execution size mismatches.